Introduction

To recap our proposal, we chose to do an NBA-related project and decided to fit a regression model for salary of NBA players, using statistics and other information about the players as covariates.

Our data includes many seasonal statistics for each player, including both simple stats and advanced metrics. We also have access to other personal information for each player, such as team, position, height, etc.

We are modeling based on the 2015-16 NBA season. We know salary is known before the season starts and statistics are created, so it is not a response variable in the traditional sense of a causal effect. We attempt to explore the relationship between salary and the covariates, and aim to prescribe a true mean function, and identify players who may be overperforming or underperforming their contract according to the 2015-16 market; i.e., putting up better or worse statistics than one might expect a player on their salary to do.

<<<<<<< HEAD

Pre-Modeling

A description of the data sources and variables can be found in the previously submitted EDA portion of the project. We proceeded to filter out players who were below the league minimum in salary, as they were exclusively players signed to short term (i.e., 10-day) contracts with wildly volatile data.

=======

Variable Selection

Our data comes from four different datasets. We used three of Riguang Wen's datasets from figshare.com -- , , and . We also used a dataset called uploaded to zenodo.org by Pablo Gomez and Sandra Giral. From these datasets, we considered the following variables.

Variable Description Type Source
Player Name of player Character
Age Age of player Numeric
G Games played Numeric
GS Games started Numeric
MP Minutes played Numeric
PER Player efficiency rating Numeric
PTS Points Numeric
X3PAr 3PA/FGA Numeric
FTr FTA/FGA Numeric
TS True shooting percentage Numeric
ORB Offensive rebounds Numeric
DRB Defensive rebounds Numeric
TRB Total rebounds Numeric
AST Assists Numeric
STL Steals Numeric
BLK Blocks Numeric
TOV Turnovers Numeric
USG Usage percentage Numeric
ORtg Offensive rating Numeric
DRtg Defensive rating Numeric
OWS Offensive win shares Numeric
DWS Defensive win shares Numeric
WS Win shares Numeric
WS.48 Win shares per 48 minutes Numeric
OBPM Offensive box +/- Numeric
DBPM Defensive box +/- Numeric
BPM Box +/- Numeric
VORP Value over replacement player Numeric
Pos Position Factor
Ht Height in inches Numeric
Wt Weight in pounds Numeric
PwrSix Power Six College? Indicator
International International Player? Indicator
Salary Salary in dollars Numeric

Immediately we can recognize that some variables are functions of others and therefore do not need to be considered. Specifically, , so there is no need to include in our model. Similarly, and , so we can exclude and from consideration if we include , , and in our model.

data_init <- read.csv("data/PlayerData.csv")[,-1] %>% mutate(logsal = log(SALARY))
data <- data_init %>% dplyr::select(logsal, Age, G, gs_adj, MP, PER, X3PAr, ftr_adj, orb_adj, DRB., AST., stl_adj, blk_adj, TOV., USG., ORtg, DRtg, ows_adj, dws_adj, WS.48, OBPM, DBPM, vorp_adj, Pos_cat, Height, pts_adj, TS., International, Conference)
attach(data)
>>>>>>> e275a16cbaf1da494b517254c95b62d4cb85c8f6

From our EDA, we concluded that a log transformation of the response variable, salary, would be appropriate based on the skewed distribution.

We also noticed that many of the covariates had skewed distributions, or did not have linear marginal relationships with the response log(salary). We attempted to make these variables approximately normally distirbutied, with an approximately linear marginal relationship with the response. We experimented with log, square root, and squaring transformations, and ended up considering the following transformations in our model:

  • VORP: We min/max normalized this between 0 and 1 to eliminate negative values, and then took the log
  • OWS: We min/max normalized this between 0 and 1 to eliminate negative values, and then took the log
  • DWS: Square root
  • PTS: Square root
  • FTr: Square root
  • BLK: log
  • STL: log
  • GS: log
  • ORB: log
<<<<<<< HEAD

Below we used VORP as an example; we can see the distribution of VORP and its plot against log(salary) pre and post transformation:

It’s not perfect, but the transformed variable appears to be much more appropriate for a linear model than the raw variable.

Variable Selection

Once our predictors are appropriately transformed, we consider all pairwise correlations between covariates as an initial search for possibly collinearity.

corrplot(cor(data[sapply(data, is.numeric)]))

# Which did we remove from here and why?

Four pairs of predictors with noticably large correlations according the the correlation matrix are further investigated. We learn that and have a correlation coefficient of \(0.903\), and have a correlation coefficient of \(0.888\), and have a correlation coefficient of \(0.864\), and finally and have a correlation coefficient of \(-0.760\).

Initial Modeling and Evaluation

We will start by looking at a full model with all possible predictors, as well as a stepwise model search using both AIC and BIC to provide another starting point.

NOTE: Due to the small nature of our dataset, we will build a model using all available data first, while attempting to be careful to avoid overfitting, and then proceed to formal evaluation techniques such as cross validation later in our analysis.

#all_models = regsubsets(SALARY ~ ., force.in = 1, data = data, nbest = max(choose(n_predictors, 0:n_predictors)), really.big = T, nvmax = 13)
null = lm(logsal ~ 1, data = data)
full = lm(logsal ~ ., data = data)
# summary(full)
=======

We can also see that Position has some redundancy and maybe too much granularity; for example "F-C" (forward/center) and "C-F" (center/forward) are treated as different positions by the model when functionally they are the same. We cleaned this up by grouping into "Guards", "Wings", and "Bigs". From the graph it is unclear if there is a significant relationship between the refined position predictor and log(salary).

pos_box = data_init %>% ggplot(aes(x = Pos, y = log(SALARY))) + geom_boxplot() + ggtitle("log(salary) by position (unrefined)")
pos_cat_box = data_init %>% ggplot(aes(x = Pos_cat, y = log(SALARY))) + geom_boxplot() + ggtitle("log(salary) by position (refined)")
grid.arrange(pos_box, pos_cat_box, nrow = 1)

Below we used VORP as an example; we can see the distribution of VORP and its plot against log(salary) pre and post transformation:

It's not perfect, but the transformed variable appears to be much more appropriate for a linear model than the raw variable. Once our predictors are appropriately transformed, we consider all pairwise correlations between covariates as an initial search for possibly collinearity.

corrplot(cor(data[sapply(data, is.numeric)]))

Four pairs of predictors with noticably large correlations according the the correlation matrix are further investigated. We learn that and have a correlation coefficient of \(0.947\), and have a correlation coefficient of \(0.888\), and have a correlation coefficient of \(0.864\), and finally and have a correlation coefficient of \(-0.760\). To avoid redundancy in the model, we'll remove one predictor in each of the four pairs from the model. The terms' variance inflation factors will guide the decision regarding which predictor to drop. Using GVIF\(^\frac{1}{2df}\) will allow the GVIFs to be comparable across dimensions. Thus, we remove (8.77 > 6.25), (4.64 > 4.29), (8.42 > 7.42), and (4.29 > 3.55).

full = lm(logsal ~ ., data = data)
vif(full)[,3]
##           Age             G        gs_adj            MP           PER 
##      1.119102      3.000854      1.860765      6.240509      7.400986 
##         X3PAr       ftr_adj       orb_adj          DRB.          AST. 
##      2.495987      1.442517      2.554112      2.256841      3.072991 
##       stl_adj       blk_adj          TOV.          USG.          ORtg 
##      1.858169      2.364577      2.885804      5.059147      6.446025 
##          DRtg       ows_adj       dws_adj         WS.48          OBPM 
##      4.268697      3.069610      4.543134      8.393912      6.126404 
##          DBPM      vorp_adj       Pos_cat        Height       pts_adj 
##      3.542653      3.301813      1.577793      2.394156      8.743606 
##           TS. International    Conference 
##      4.631713      1.113687      1.081437
data <- data_init %>% dplyr::select(logsal, Age, G, gs_adj, MP, PER, X3PAr, ftr_adj, orb_adj, DRB., AST.,
                                    stl_adj, blk_adj, TOV., USG., ORtg, ows_adj, dws_adj,
                                    OBPM, DBPM, vorp_adj, Pos_cat, Height, International,
                                    Conference)

We will start by looking at a full model with all remaining predictors. We observe that in this case, Age, MP, vorp_adj, and Multi-team affiliation have very significant relationships with the log of salary, even after accounting for the other predictors. Other variables that have moderately significant relationships with the log of salary even after accounting for the other predictors are G, orb_adj, USG., and ows_adj.

#all_models = regsubsets(SALARY ~ ., force.in = 1, data = data, nbest = max(choose(n_predictors, 0:n_predictors)), really.big = T, nvmax = 13)
null = lm(logsal ~ 1, data = data)
full = lm(logsal ~ ., data = data)
summary(full)
## 
## Call:
## lm(formula = logsal ~ ., data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.12733 -0.46812  0.04338  0.55034  1.86549 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       8.7021392  2.9814207   2.919 0.003736 ** 
## Age               0.0658064  0.0106113   6.202 1.54e-09 ***
## G                -0.0181663  0.0055382  -3.280 0.001139 ** 
## gs_adj            0.0194856  0.0475846   0.409 0.682422    
## MP                0.0007226  0.0001837   3.933 0.000101 ***
## PER              -0.0871335  0.0465252  -1.873 0.061907 .  
## X3PAr             0.2832213  0.5158578   0.549 0.583327    
## ftr_adj          -0.4945672  0.3990974  -1.239 0.216078    
## orb_adj           0.3410770  0.1307090   2.609 0.009450 ** 
## DRB.             -0.0136510  0.0159678  -0.855 0.393174    
## AST.              0.0189639  0.0094084   2.016 0.044586 *  
## stl_adj           0.1826712  0.2372052   0.770 0.441750    
## blk_adj          -0.1199107  0.1824635  -0.657 0.511490    
## TOV.              0.0036909  0.0141669   0.261 0.794603    
## USG.              0.0728354  0.0251642   2.894 0.004032 ** 
## ORtg              0.0150725  0.0144326   1.044 0.297032    
## ows_adj           1.2644686  0.3969230   3.186 0.001571 ** 
## dws_adj           0.6322679  0.2681370   2.358 0.018911 *  
## OBPM              0.1038468  0.0754130   1.377 0.169360    
## DBPM              0.1310865  0.0565073   2.320 0.020913 *  
## vorp_adj         -0.9380214  0.1972369  -4.756 2.87e-06 ***
## Pos_catG         -0.4680186  0.2487174  -1.882 0.060684 .  
## Pos_catWing      -0.2256998  0.1571007  -1.437 0.151689    
## Height            0.0194139  0.0290128   0.669 0.503831    
## InternationalYes -0.0264050  0.1100976  -0.240 0.810597    
## ConferenceMulti  -0.6825231  0.1578069  -4.325 1.98e-05 ***
## ConferenceWest   -0.0928064  0.0952214  -0.975 0.330398    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8297 on 358 degrees of freedom
## Multiple R-squared:  0.557,  Adjusted R-squared:  0.5249 
## F-statistic: 17.32 on 26 and 358 DF,  p-value: < 2.2e-16

We'll next run a stepwise model search using both AIC and BIC to provide another starting point:

>>>>>>> e275a16cbaf1da494b517254c95b62d4cb85c8f6
bic = log(nrow(data))
stepbic_model <- stepAIC(object = null, scope = list(lower = null, upper = full),
direction = "both", k = bic)
stepaic_model <- stepAIC(object = null, scope = list(lower = null, upper = full),
direction = "both", k = 2)
print(stepaic_model$call)
<<<<<<< HEAD
## lm(formula = logsal ~ pts_adj + Age + G + MP + stl_adj + Height + 
##     TS. + ORtg + Conference + TOV., data = data)
print(stepbic_model$call)
## lm(formula = logsal ~ pts_adj + Age + G + orb_adj + MP + AST. + 
##     stl_adj, data = data)

From this initial analysis, some important predictors appear to be Age, DBPM, Conference, Games, VORP, and Points. We can see that the ‘conference’ factor is really picking up on players that appeared on multiple teams. Sometimes these are good players who have been traded, but often these players are end of the bench guys that sign cheap, short term contracts. This having a relationship with salary would make sense. We created in our preprocessing script a new variable that just flags players that appeared on multiple teams within the season.

We can also see that Position has some redundancy and maybe too much granularity; for example “F-C” (forward/center) and “C-F” (center/forward) are treated as different positions by the model when functionally they are the same. We cleaned this up by grouping into “Guards”, “Wings”, and “Bigs”. From the graph it is unclear if there is a significant relationship between the refined position predictor and log(salary).

pos_box = data_init %>% ggplot(aes(x = Pos, y = log(SALARY))) + geom_boxplot() + ggtitle("log(salary) by position (unrefined)")
pos_cat_box = data_init %>% ggplot(aes(x = Pos_cat, y = log(SALARY))) + geom_boxplot() + ggtitle("log(salary) by position (refined)")
grid.arrange(pos_box, pos_cat_box, nrow = 1)

We’ll look at some diagnostic plots for these models to see where we’re at:

=======
## lm(formula = logsal ~ MP + Age + Conference + USG. + DBPM + orb_adj + 
##     Pos_cat + AST. + vorp_adj + OBPM + G + PER + ows_adj + dws_adj, 
##     data = data)
print(stepbic_model$call)
## lm(formula = logsal ~ MP + Age + Conference + USG. + DBPM, data = data)

From this initial analysis, some important predictors appear to be MP, Age, Conference, USG., and DBPM. We can see that the 'conference' factor is really picking up on players that appeared on multiple teams. Sometimes these are good players who have been traded, but often these players are end of the bench guys that sign cheap, short term contracts. This having a relationship with salary would make sense. We created in our preprocessing script a new variable that just flags players that appeared on multiple teams within the season.

We'll look at some diagnostic plots for these models to see where we're at:

>>>>>>> e275a16cbaf1da494b517254c95b62d4cb85c8f6
stepaic_resid_plot <- data.frame(resid=stepaic_model$residuals, fitted_logsal=stepaic_model$fitted.values) %>%  ggplot(aes(x=fitted_logsal, y=resid)) + geom_point() + ggtitle("Stepwise AIC model residual plot")
stepbic_resid_plot <- data.frame(resid=stepbic_model$residuals , fitted_logsal=stepbic_model$fitted.values) %>%  ggplot(aes(x=fitted_logsal, y=resid)) + geom_point() + ggtitle("Stepwise BIC model residual plot")

grid.arrange(stepaic_resid_plot, stepbic_resid_plot, nrow = 1)
<<<<<<< HEAD

We can see that there are some problems with these residual plots. The densest area of both plots appears to be following a downward trend, and the variance is not constant across all fitted values, i.e., these are not null plots. The models need improvement. It is also noteworthy that despite AIC including several more predictors, the residual plots are very similar.

=======

We can see that there are some problems with these residual plots. The densest areas of each plot appear to be following a downward trend, and the variance is not constant across all fitted values, i.e., these are not null plots. The models need improvement. It is also noteworthy that despite AIC including several more predictors, the residual plots are very similar.

>>>>>>> e275a16cbaf1da494b517254c95b62d4cb85c8f6

Added Variable Plots

After making the adjustments to a few of the predictor variables described above, we continue the iterative process of exploratory model building.

data_trimmed <- data_init %>% dplyr::select(logsal, Age, G, gs_adj, MP, PER, X3PAr, ftr_adj, orb_adj, DRB., AST., stl_adj, blk_adj, TOV., USG., ORtg, DRtg, ows_adj, dws_adj, WS.48, OBPM, DBPM, vorp_adj, Pos_cat, Height, pts_adj, TS., International, Multiteam)
trimmed_null = lm(logsal ~ 1, data = data_trimmed)
trimmed_full = lm(logsal ~ ., data = data_trimmed)
corrplot(cor(data_trimmed[sapply(data_trimmed, is.numeric)]), method="number")
<<<<<<< HEAD

trimmed_aic <- stepAIC(object = trimmed_null, scope = list(lower = trimmed_null, upper = trimmed_full),
direction = "both", k = 2)
summary(trimmed_full)
print(trimmed_aic$call)
trimmed_resid_plot <- data.frame(resid=trimmed_aic$residuals , fitted_logsal=trimmed_aic$fitted.values) %>%  ggplot(aes(x=fitted_logsal, y=resid)) + geom_point() + ggtitle("Stepwise AIC model residual plot")
# trimmed_resid_plot
=======

trimmed_aic <- stepAIC(object = trimmed_null, scope = list(lower = trimmed_null, upper = trimmed_full),
direction = "both", k = 2)
summary(trimmed_full)
## 
## Call:
## lm(formula = logsal ~ ., data = data_trimmed)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.97747 -0.44726  0.02749  0.55182  2.12532 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      10.7887241  5.7929454   1.862 0.063374 .  
## Age               0.0619535  0.0105083   5.896 8.69e-09 ***
## G                -0.0211991  0.0055688  -3.807 0.000166 ***
## gs_adj           -0.0041681  0.0482924  -0.086 0.931269    
## MP               -0.0001719  0.0003178  -0.541 0.588925    
## PER               0.0339671  0.0610374   0.556 0.578222    
## X3PAr             0.3452160  0.5324448   0.648 0.517171    
## ftr_adj          -0.3683558  0.4140906  -0.890 0.374308    
## orb_adj           0.1968777  0.1632128   1.206 0.228519    
## DRB.             -0.0263340  0.0160502  -1.641 0.101739    
## AST.              0.0049831  0.0138668   0.359 0.719542    
## stl_adj          -0.0978102  0.2632827  -0.372 0.710485    
## blk_adj          -0.3489851  0.1926931  -1.811 0.070972 .  
## TOV.              0.0267741  0.0249610   1.073 0.284162    
## USG.             -0.0129629  0.0417970  -0.310 0.756637    
## ORtg              0.0500782  0.0235417   2.127 0.034092 *  
## DRtg             -0.0297096  0.0450437  -0.660 0.509956    
## ows_adj           0.7769953  0.4639111   1.675 0.094838 .  
## dws_adj           0.7746716  0.3594889   2.155 0.031840 *  
## WS.48            -9.0274781  4.7809750  -1.888 0.059813 .  
## OBPM              0.0728867  0.0936530   0.778 0.436932    
## DBPM              0.1185353  0.0726130   1.632 0.103477    
## vorp_adj         -0.6613434  0.2286878  -2.892 0.004065 ** 
## Pos_catG         -0.4969772  0.2457386  -2.022 0.043886 *  
## Pos_catWing      -0.2122213  0.1554118  -1.366 0.172948    
## Height            0.0224363  0.0288497   0.778 0.437266    
## pts_adj           0.0998109  0.0362844   2.751 0.006250 ** 
## TS.              -4.2719066  3.1091010  -1.374 0.170308    
## InternationalYes -0.0032859  0.1099257  -0.030 0.976170    
## Multiteam        -0.5938917  0.1483221  -4.004 7.58e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8198 on 355 degrees of freedom
## Multiple R-squared:  0.5712, Adjusted R-squared:  0.5361 
## F-statistic:  16.3 on 29 and 355 DF,  p-value: < 2.2e-16
print(trimmed_aic$call)
## lm(formula = logsal ~ pts_adj + Age + Multiteam + DBPM + G + 
##     vorp_adj + orb_adj + blk_adj + Pos_cat + AST. + DRB. + dws_adj + 
##     X3PAr + ows_adj, data = data_trimmed)
>>>>>>> e275a16cbaf1da494b517254c95b62d4cb85c8f6

Diagnostics

<<<<<<< HEAD =======

From this point, we can see that adjusted VORP, adjusted points, Age, G, and Multiteam are the most important predictors, with many others appearing to be useful. We'll look at residuals of a basic model and see where we're at:

trimmed_resid_plot <- data.frame(resid=trimmed_aic$residuals , fitted_logsal=trimmed_aic$fitted.values) %>%  ggplot(aes(x=fitted_logsal, y=resid)) + geom_point() + ggtitle("Stepwise AIC model residual plot")
trimmed_resid_plot

>>>>>>> e275a16cbaf1da494b517254c95b62d4cb85c8f6

The residual cloud is similar to those before we cut unnecessary predictors.

Leverage Points and Cook's Distances

Now, we'll look at some of the highest leverage cases.

show_diagnostics <- function(model){
  x = model.matrix(model)
  h = x %*% solve(t(x) %*% x) %*% t(x)
  leverages = diag(h)
  lev_resid_plot <- data.frame(resid=model$residuals , fitted_logsal=model$fitted.values, leverage=leverages) %>%  ggplot(aes(x=fitted_logsal, y=resid)) + geom_point(aes(size=leverages, color=leverages)) + ggtitle("Residual plot with leverages")
  cooks = cooks.distance(model)
  cook_resid_plot <- data.frame(resid=model$residuals , fitted_logsal=model$fitted.values, cook_distance=cooks) %>%  ggplot(aes(x=fitted_logsal, y=resid)) + geom_point(aes(size=cook_distance, color=cook_distance)) + ggtitle("Residual plot with cooks distance")
  print(lev_resid_plot)
  print(cook_resid_plot)
}

show_diagnostics(trimmed_aic)
<<<<<<< HEAD

=======

>>>>>>> e275a16cbaf1da494b517254c95b62d4cb85c8f6

Outliers

There's an outlying case with a large leverage and cooks distance; we'll use a t-test for a mean shift to formally test if this is an outlier:

test_outlier_meanshift <- function(model, ind){
  p = length(coef(model))
  n = nrow(model.matrix(model))
  ri = rstandard(model)[ind]
  t_stat = ri * sqrt(((n - p - 1) / (n - p - (ri ^ 2))))
  pval = pt(t_stat, n - p - 1)
  return (pval)
}

test_outlier_meanshift(trimmed_aic, which.max(cooks.distance(trimmed_aic)))
##         275 
## 0.009002786

We notice that the p-value for a mean shift for this case is small. Looking at the data for this particular observation, we see that it is an outlier on many levels - this player had 2 games played, 6 minutes, a wildly high PER and Usage rate, 0 for many counting stats (Assists, steals, etc.), and a very small salary of $30000. Many of these stats are unstable due to the very small number of minutes played. Also, this player is Thanasis Antetokounmpo, who is a unique case because many people consider him to be a non-NBA level player, who only got into the league because his brother Giannis is one of the best players in the world. Thus, we feel he is not a useful data point, and are comfortable removing him from the dataset. We refit a basic model without him.

data <- data_init %>% filter(Player != "Thanasis Antetokounmpo")

data_trimmed <- data %>% mutate(logsal = log(SALARY)) %>% dplyr::select(logsal, Age, G, gs_adj, MP, PER, X3PAr, ftr_adj, orb_adj, DRB., AST., stl_adj, blk_adj, TOV., USG., ORtg, DRtg, ows_adj, dws_adj, WS.48, OBPM, DBPM, vorp_adj, Pos_cat, Height, pts_adj, TS., International, Multiteam)
trimmed_null = lm(logsal ~ 1, data = data_trimmed)
trimmed_full = lm(logsal ~ ., data = data_trimmed)
vif(trimmed_full)
summary(trimmed_full)
trimmed_aic <- stepAIC(object = trimmed_null, scope = list(lower = trimmed_null, upper = trimmed_full),
direction = "both", k = 2)
show_diagnostics(trimmed_aic)
<<<<<<< HEAD

=======

>>>>>>> e275a16cbaf1da494b517254c95b62d4cb85c8f6

Variance Inflation Factors

We can also use VIFs to evaluate redundancy in data:

full = lm(logsal ~ ., data = data_trimmed)
vif(full)[,3]
##           Age             G        gs_adj            MP           PER 
##      1.107415      2.827530      1.820040      6.407954      7.666727 
##         X3PAr       ftr_adj       orb_adj          DRB.          AST. 
##      2.544538      1.486126      2.893810      2.278635      3.408641 
##       stl_adj       blk_adj          TOV.          USG.          ORtg 
##      1.810277      2.447309      2.866146      5.298603      6.323095 
##          DRtg       ows_adj       dws_adj         WS.48          OBPM 
##      4.243524      3.206647      4.577049      8.170424      6.002826 
##          DBPM      vorp_adj       Pos_cat        Height       pts_adj 
##      3.501521      3.428825      1.589720      2.398936      9.661666 
##           TS. International     Multiteam 
##      4.558321      1.120978      1.066601
#full2 = lm(logsal ~ . - WS.48 - PER - ORtg - PTS, data = data) # Sequentially removed highest VIF until next iteration lowered both Adj R2 and Mult R2

Model Fitting

Test for Interaction

Test for Higher Order

Model Validation

Final Model

Discussion

Conclusion